Module 02 - Python for LLM Engineering
Reading time: ~20 minutes | Level: Advanced
The Gap Between a Demo and a Product
You have called openai.chat.completions.create() before. You pasted in a prompt. The model replied. It worked.
Now imagine you are three months into building a product on top of that API call. Here is what your codebase looks like now:
# Version 1: The demo (day 1)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Version 2: The production system (month 3)
async def generate(
prompt: str,
*,
system: str | None = None,
max_tokens: int = 2048,
temperature: float = 0.7,
model: str = "gpt-4o",
user_id: str | None = None,
) -> AsyncIterator[str]:
messages = _build_messages(system, prompt, history=await _load_history(user_id))
_check_token_budget(messages, max_tokens, model)
async with _rate_limiter:
async for chunk in _stream_with_retry(
model=model,
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
):
_log_chunk(chunk, user_id)
yield chunk.delta.content or ""
await _persist_turn(user_id, prompt, accumulated_response)
await _update_cost_ledger(user_id, _count_tokens(messages, model))
That gap is what this module teaches. Not LLM theory. Not prompt writing tips. The engineering required to put an LLM in production and keep it there.
What Makes LLM Engineering Hard
Calling a REST API sounds trivial. But LLM APIs violate almost every assumption you have about how APIs behave.
They are stateless but your users expect state. The model has no memory. Every request must carry the full conversation history. You manage that history, decide what to include, and handle context windows that overflow.
They are non-deterministic. The same input produces different outputs on every call. You cannot cache responses the way you cache database queries. Testing requires probabilistic approaches.
They are slow. A GPT-4o call takes 2-30 seconds depending on output length. That latency is unacceptable in a synchronous web handler. You need streaming, async, and careful UX design.
They fail in unexpected ways. Rate limits hit mid-conversation. The API returns a partial JSON object. A tool call arrives split across two stream chunks. Connection drops after 40 tokens. Every failure mode is different.
They are expensive and hard to meter. Cost is per token. A single runaway prompt loop can burn hundreds of dollars in minutes. You need token budgets, cost tracking, and circuit breakers.
The output is unstructured. Even with JSON mode, models occasionally produce invalid JSON, truncated arrays, or hallucinated field names. Parsing LLM output requires defensive code.
This module addresses all of these problems systematically.
What You Will Build
Across six lessons, you will construct the Python infrastructure that underlies every serious LLM application:
| Lesson | What You Build |
|---|---|
| 01 -- Calling LLM APIs | Production API client with retry, rate limiting, cost tracking, and structured output |
| 02 -- Streaming | Async streaming pipeline from API to FastAPI endpoint to browser |
| 03 -- Prompt Templates | Versioned prompt system with Jinja2, validation, and injection defense |
| 04 -- Token Counting | tiktoken-based budget manager, context truncation strategies, sliding window |
| 05 -- Tool Use | Python function dispatcher, schema generation, multi-step agent loop |
| 06 -- Vector Search | Embedding pipeline, FAISS index, retrieval-augmented generation (RAG) |
By the end, you will have a working skeleton of a production RAG chatbot with tool use, streaming, and cost tracking.
A Map of a Production LLM System
Before diving into individual lessons, step back and see the whole picture. Here is the Python layer of a production LLM system:
Each box in this diagram corresponds to code you will write in this module. The lessons are ordered so that each one builds on the last.
Mental Models You Need
Tokens Are Not Characters
This is the most common mistake beginners make. Models do not see text. They see tokens -- integer IDs representing subword units. The word "python" is one token. The word "unbelievable" might be two or three tokens depending on the tokenizer. A space before a word is often a different token than the same word without a space.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Python is unbelievable."
tokens = enc.encode(text)
print(tokens) # [31380, 374, 46455, 13]
print(len(tokens)) # 4 tokens for 23 characters
Why it matters:
- API costs are per token, not per character
- Context window limits are in tokens (128K tokens is not 128K characters)
- Token counting must happen before every API call to stay within budget
- Truncation must happen at token boundaries, not character boundaries
Context Windows Are Finite Queues
Every model has a maximum context window: the total number of tokens it can process in a single call (input + output combined). As of 2025, frontier models have 128K-200K token windows. That sounds large until you are building a multi-document RAG system.
The mental model: the context window is a sliding window over an infinite conversation. When it fills up, you must decide what to drop. This is not automatic. You must implement the eviction policy.
The orange "Retrieved Docs" region is the one you control most directly. The retrieval strategy determines what goes here and in what priority order.
Embeddings Are Vectors in Semantic Space
An embedding model converts text into a dense vector of floating-point numbers (typically 768-3072 dimensions). Texts with similar meaning have vectors that are close together in this high-dimensional space. "The cat sat on the mat" and "A feline rested on a rug" are semantically similar -- their embeddings will have high cosine similarity.
# Conceptually (not actual embedding dimensions shown)
embedding("The cat sat on the mat") == [0.12, -0.34, 0.89, ...]
embedding("A feline rested on a rug") == [0.11, -0.31, 0.91, ...] # close
embedding("Stock prices fell 3%") == [-0.67, 0.22, -0.44, ...] # far
This is the foundation of RAG: embed your documents, store the vectors in a vector database, embed the user's query, find the nearest document vectors, and inject those documents into the LLM's context. The LLM then answers using retrieved facts rather than hallucinating from training data.
Streaming Is a Protocol, Not a Feature
When you enable streaming, the API does not wait for the full response before sending anything. It sends tokens as they are generated, using HTTP chunked transfer encoding or Server-Sent Events. Your Python code receives a stream of deltas -- tiny JSON objects each containing a fragment of the response.
# Non-streaming: one response object, arrives after full generation
response = client.chat.completions.create(model="gpt-4o", messages=[...])
print(response.choices[0].message.content) # The whole thing at once
# Streaming: many delta objects, arrive as they are generated
stream = client.chat.completions.create(model="gpt-4o", messages=[...], stream=True)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True) # Print each fragment immediately
The moment you add tool use to your streaming pipeline, things get more complex: tool call arguments arrive in fragments too, and you cannot dispatch the tool until you have reassembled the complete JSON arguments.
How This Module Fits the Learning Path
This module assumes you have completed:
- Python Foundation: functions, classes, exceptions, file I/O
- Python Intermediate: async/await,
asyncio, generators, context managers, type hints - Python Advanced Module 1: scientific Python stack (NumPy, Pandas, PyTorch basics)
It feeds into:
- Module 3 -- ML Engineering: building training pipelines with LLM-generated data, evaluation harnesses
- Module 4 -- MLOps: deploying LLM applications, A/B testing prompts in production, model versioning
If you have completed the Python Advanced track (metaprogramming, async, performance), many patterns in this module will feel familiar. You will recognize async generators in the streaming lesson, context managers in the API client patterns, and dataclasses in the prompt template system.
Prerequisites Checklist
Before starting Lesson 01, verify you can answer these questions without looking them up:
- What is
async defand when do you need it instead ofdef? - What does
awaitdo, and what can youawait? - What is
AsyncIterator[str]and how do you consume one withasync for? - What is a context manager (
withstatement) and how does__enter__/__exit__work? - What is a dataclass and how does it differ from a plain class?
- What does
@retryfrom thetenacitylibrary do?
If any of these are unclear, review the relevant Python Intermediate or Advanced lessons before continuing.
Environment Setup
All lessons in this module use these packages. Install them now:
pip install anthropic openai tiktoken tenacity httpx fastapi uvicorn \
jinja2 pydantic faiss-cpu numpy sentence-transformers
You will need API keys from Anthropic and/or OpenAI. Store them as environment variables:
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
Or use a .env file with python-dotenv:
from dotenv import load_dotenv
load_dotenv() # reads .env into os.environ
Never hardcode API keys in source code. Never commit them to version control. Use .env files locally and environment variables in production. One leaked key in a public GitHub repository will be found by automated scanners within minutes.
Key Takeaways
- The distance between an LLM demo and a production LLM system is measured in retry logic, token budgets, streaming pipelines, and cost tracking.
- Tokens are not characters. Always count tokens before calling the API, not after.
- The context window is a finite resource you manage. Eviction policy is your responsibility.
- Embeddings map text to vectors in semantic space. Closeness in vector space means semantic similarity.
- Streaming is a chunked protocol. Tool calls in streams must be reassembled before dispatch.
- This module builds the Python infrastructure for a production LLM system, lesson by lesson.
Start with Lesson 01 -- Calling LLM APIs.
